Reinforcement Learning - An Implementation of Action Value Methods

Machine Learning Tokyo

MLT Slack - Join the #rl_book channel

Things covered

  • Epsilon Decay Strategy for Exploitation Vs Exploration Balancing
  • Initiating the Bandits of the k-Armed Bandit problem (test-bed)
  • Action Value methods
    • Simple methods e.g. using a sample averaging technique to evaluate best actions
    • Incremental Implementation e.g. refining the actions as we experiment
In [1]:
import numpy as np
In [2]:
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from datetime import datetime

Epsilon Decay Strategy

This is a simple way of letting the agents balance their ability to Explore Vs Exploit

Exploitation

This is when we use the current knowledge of the Agent, to choose the next action.

Exploration

This is when we let the Agent randomly choose an action in order to learn better by trying to know the unknown :-P, well so to say ! :-)

We provide a coldness parameter (just an inverse of the temperature parameter usually used in ML/AI), which defines how quickly/coldly our negative exponential function decays. We present 3 different epsilon/decay/coldness factors for an example of 2000 episodes

In [3]:
def epsilon(episode,coldness, ep_min=0,ep_max=1):
    return ep_min + (ep_max - ep_min)*(np.exp(-coldness * episode))
In [4]:
fig = make_subplots(rows=1,cols=3)
fig.add_trace(go.Scatter(x=list(range(500)),
                         y=[epsilon(i,0.001) for i in range(500)],name="ep=0.001"),
             row=1,col=1)
fig.add_trace(go.Scatter(x=list(range(500)),
                         y=[epsilon(i,0.01) for i in range(500)],name="ep=0.01"),
             row=1,col=2)
fig.add_trace(go.Scatter(x=list(range(500)),
                         y=[epsilon(i,0.1) for i in range(500)],name="ep=0.1"),
             row=1,col=3)
fig.update_layout(title_text="Epsilon Decay for 2000 episodes")
fig.show()

K-Armed Bandit Test Bed

The K-Armed Bandit test bed has the following features (Note for simplicity for this notebook, we consider k=10 all through)

  • There are 10 possible actions, namely 0 ~ 9.
  • Every action has a mean reward value, and this mean reward is sampled from a normal/gaussian distribution of mean=0 and sigma=1. This can be seen from the bar plots presented below
  • For every action, with the mean reward values as built above, we further randomize the rewards that can be achieved by these actions. That is, the reward returned by chossing an an action is sampled from a normal/guassian distribution of mean = <built above> and sigma = 1. This can be seen from the bar plots presented below.
In [11]:
def bandit_mean_distribution(K):
    k_bandit_means = np.random.normal(0.0,1.0,K)
    return k_bandit_means
k_means = bandit_mean_distribution(10)
In [12]:
bandit_data = list()
for idx,m in enumerate(k_means):
    bandit_data.append(np.random.normal(m,1.0,1000).tolist())
In [13]:
fig = go.Figure()
for i in range(10):
    fig.add_trace(go.Box(y=bandit_data[i],name="bandit_"+str(i)))
fig.show()
Some helper functions
In [14]:
def get_new_q_dict():
    q_d = dict()
    for k in range(10):
        q_d[k]=dict()
        q_d[k]["q_value"] = 0
        q_d[k]["n_exec"] = 0
    return q_d
In [15]:
def execute(n_episodes,n_steps,decay,update_func):
    ep_explore = list()
    ep_exploit = list()
    ep_avg_reward = list()

    q_dict = get_new_q_dict()
    
    for ep in range(n_episodes):
        if (ep+1) % 500 == 0:
            print("Done with %d episodes" %((ep+1)))
        explore = 0
        exploit = 0
        reward = 0
        for t in range(n_steps):
            r = np.random.random()
            eps = epsilon(ep,decay)

            if r >= eps:
                # Exploit #
                exploit+=1
                action = np.argmax([q_dict[k]["q_value"] for k in range(10)])
                current_reward = np.random.choice(bandit_data[action])
                q_dict[action]["n_exec"] += 1
                q_dict[action]["q_value"] = update_func(q_dict[action]["q_value"],current_reward, q_dict[action]["n_exec"])
                reward += current_reward
            else:
                # Explore #
                explore+=1
                action = np.random.choice(range(10))
                current_reward = np.random.choice(bandit_data[action])
                q_dict[action]["n_exec"] += 1
                q_dict[action]["q_value"] = (q_dict[action]["q_value"] + current_reward)/q_dict[action]["n_exec"]
                reward += current_reward
        ep_explore.append(explore)
        ep_exploit.append(exploit)
        ep_avg_reward.append(reward/n_steps)
    return q_dict, ep_explore, ep_exploit, ep_avg_reward
In [16]:
def reward_graph(smallest, bigger, biggest,n_episodes):
    fig = go.Figure()
    fig.add_trace(go.Scatter(x=list(range(n_episodes)),y=smallest,name="ep=0.001"))
    fig.add_trace(go.Scatter(x=list(range(n_episodes)),y=bigger,name="ep=0.01"))
    fig.add_trace(go.Scatter(x=list(range(n_episodes)),y=biggest,name="ep=0.1"))
    fig.show()

Incremental Implementation

In this we update the Q(a) values in an incremental manner. Based on the formula below

New Estimate = Old Estimate + StepSize[Target - OldEstimate]

The above translates in to mathematical formulation as

New_Q = Q_n + (1/n)[R_n - Q_n]

In [22]:
def update_q_incremental(cur_value,cur_reward,n_exec):    # This all we need provide to the execute function to update q_values in the dict
    return cur_value + (1/n_exec)*(cur_reward - cur_value)

Performing evaluation with n_epsiodes = 2000, n_steps = 1000, epsilon_decay = 0.001

In [23]:
s_time = datetime.now()
inc_q_dict_smallest, \
inc_ep_explore_smallest, \
inc_ep_exploit_smallest, \
inc_ep_avg_reward_smallest = execute(2000,1000,0.001,update_q_incremental)
print("Total elapsed time : %s" %((datetime.now()-s_time).total_seconds()))
Done with 500 episodes
Done with 1000 episodes
Done with 1500 episodes
Done with 2000 episodes
Total elapsed time : 122.465848

Performing evaluation with n_epsiodes = 2000, n_steps = 1000, epsilon_decay = 0.01

In [24]:
s_time = datetime.now()
inc_q_dict_bigger, \
inc_ep_explore_bigger, \
inc_ep_exploit_bigger, \
inc_ep_avg_reward_bigger = execute(2000,1000,0.01,update_q_incremental)
print("Total elapsed time : %s" %((datetime.now()-s_time).total_seconds()))
Done with 500 episodes
Done with 1000 episodes
Done with 1500 episodes
Done with 2000 episodes
Total elapsed time : 123.207757

Performing evaluation with n_epsiodes = 2000, n_steps = 1000, epsilon_decay = 0.1

In [25]:
s_time = datetime.now()
inc_q_dict_biggest, \
inc_ep_explore_biggest, \
inc_ep_exploit_biggest, \
inc_ep_avg_reward_biggest = execute(2000,1000,0.1,update_q_incremental)
print("Total elapsed time : %s" %((datetime.now()-s_time).total_seconds()))
Done with 500 episodes
Done with 1000 episodes
Done with 1500 episodes
Done with 2000 episodes
Total elapsed time : 128.980846
In [26]:
reward_graph(inc_ep_avg_reward_smallest,inc_ep_avg_reward_bigger,inc_ep_avg_reward_biggest,2000)

Inferences

IMPORTANT NOTE Unlike what is given in the book, please remember that as we increase the epsilon_decay factor, we have LESSER EXPLORATION and MORE EXPLOITATION and visa-versa. Therefore, you would see that as we decrease the value of epsilon_decay, the rewards become much better (overall average rewards).

  • For ep = 0.1, since this decay is very strong, therefore, very less exploration happen, and hence, very less overall rewards (Green Line)
  • For ep = 0.01, since this decay value is lower, therefore, considerable exploration happens, and hence, we have higher rewards, as a matter of fact best (Red Line)
  • For ep = 0.001, since this decay value is very very low, essentially, the agents Explores more than what is required, and exploits lesser, leading to the fact that the rewards do not maximize in longer term.

Therefore, it is important to have exploration, however, it is also important for the model to exploit at later stages of episodes.

In [27]:
fig = make_subplots(rows=2,cols=1,
                    specs = [[{"secondary_y" : True}],
                             [{"secondary_y" : True}]])
fig.add_trace(
    go.Bar(x=list(range(10)),y=[ inc_q_dict_smallest[k]["n_exec"] for k in range(10)],name="n_exec"),
    secondary_y=False,
    row=1,col=1)
fig.add_trace(
    go.Scatter(x=list(range(10)),y=[ inc_q_dict_smallest[k]["q_value"] for k in range(10)],name="q_value"),
    secondary_y=True,
    row=1,col=1)
for i in range(10):
    fig.add_trace(go.Box(y=bandit_data[i],name="bandit_"+str(i)),row=2,col=1,secondary_y=False)
    
fig.update_layout(height=600,width=800)
fig.show()

Major Inference

  • We can clearly see that the agent has gathered the information, the action which had the highest mean reward value, it has sampled more actions from that that action. Leading to higher rewards.

This is indeed a unique inference, as without ability to look at the distribution of rewards for every action, the agent has been able to gather that information through experiments.

Seeing the Q-Value information

In [28]:
inc_q_dict_smallest # for epsilon = 0.001
Out[28]:
{0: {'q_value': 0.0006792578592940979, 'n_exec': 775414},
 1: {'q_value': 6.9115211894677785e-06, 'n_exec': 184155},
 2: {'q_value': -1.3253972661349099e-05, 'n_exec': 101813},
 3: {'q_value': -8.305293129540138e-06, 'n_exec': 132082},
 4: {'q_value': -1.3463638169271376e-05, 'n_exec': 90766},
 5: {'q_value': 1.8183768344094e-06, 'n_exec': 205888},
 6: {'q_value': -3.0938657769470836e-05, 'n_exec': 93430},
 7: {'q_value': -1.5934961827613188e-05, 'n_exec': 88178},
 8: {'q_value': 1.7699782101929494e-06, 'n_exec': 224995},
 9: {'q_value': -2.1428900914294525e-06, 'n_exec': 103279}}
In [29]:
inc_q_dict_bigger   # for epsilon = 0.01
Out[29]:
{0: {'q_value': 1.552173633184867, 'n_exec': 1647982},
 1: {'q_value': 1.946706073590723e-05, 'n_exec': 65871},
 2: {'q_value': 0.00012794047965750065, 'n_exec': 12139},
 3: {'q_value': 9.460522567973362e-05, 'n_exec': 18564},
 4: {'q_value': -0.00015986605810205595, 'n_exec': 10596},
 5: {'q_value': -1.0835391411171483e-06, 'n_exec': 128804},
 6: {'q_value': -4.9312527121881265e-05, 'n_exec': 11125},
 7: {'q_value': -0.00022553284723169517, 'n_exec': 10163},
 8: {'q_value': 3.521421783596986e-06, 'n_exec': 82254},
 9: {'q_value': 2.4852002088394332e-06, 'n_exec': 12502}}
In [30]:
inc_q_dict_biggest  # for epsilon = 0.1
Out[30]:
{0: {'q_value': 4.919822189660884e-05, 'n_exec': 38328},
 1: {'q_value': 0.17098618086892756, 'n_exec': 1906107},
 2: {'q_value': -0.0007682117633737084, 'n_exec': 1303},
 3: {'q_value': 2.8378627108308877e-05, 'n_exec': 1950},
 4: {'q_value': -0.0004085573045349638, 'n_exec': 1056},
 5: {'q_value': 0.00019306501852921123, 'n_exec': 5480},
 6: {'q_value': -0.0002139962814238395, 'n_exec': 1157},
 7: {'q_value': -0.0003094643694519499, 'n_exec': 1077},
 8: {'q_value': -7.094747453988671e-06, 'n_exec': 42271},
 9: {'q_value': -0.001146831041053481, 'n_exec': 1271}}

Inferences from Q-Values

  • We can see that with
    • ep = 0.001 : q_values are spread out, and there is no clear distinction, which is a good action. This is because of excessive exploration done by the model
    • ep = 0.01 : q_values are facoring Action 0, which has the highest mean rewards, and therefore seems to be best epsilon decay factor
    • ep = 0.1 : Agent tool maximum of Action 1, which had a lower q_value, because of lack of randomness. This clearly not a good epsilon decay strategy.
In [ ]: